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ABSTRACT 

Summary: Spoligotyping is a well-established genotyping technique 
based on the presence of unique DNA sequences in Mycobacterium 
tuberculosis (Mtb), the causal agent of tuberculosis disease (TB). 
Although advances in sequencing technologies are leading to 
whole-genome bacterial characterization, tens of thousands of isolates 
have been spoligotyped, giving a global view oiMtb strain diversity. To 
bridge the gap, we have developed SpolPred, a software to predict the 
spoligotype from raw sequence reads. Our approach is compared with 
experimentally and de novo assembly determined strain types in a set 
of 44 Mtb isolates. In silico and experimental results are identical for 
almost all isolates (39/44). However, SpolPred detected five experi- 
mentally false spoligotypes and was more accurate and faster than 
the assembling strategy. Application of SpolPred to an additional 
seven isolates with no laboratory data led to types that clustered 
with identical experimental types in a phylogenetic analysis using 
single-nucleotide polymorphisms. Our results demonstrate the useful- 
ness of the tool and its role in revealing experimental limitations. 
Availability and implementation: SpolPred is written in C and is avail- 
able from www.pathogenseq.org/spolpred. 
Contact: francesc.coll@lshtm.ac.uk 

Supplementary information: Supplementary data are available at 
Bioinformatics Online. 
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1 INTRODUCTION 

Tuberculosis is an infectious disease caused by bacterium of the 
Mycobacterium tuberculosis (Mtb) complex. Genotyping tech- 
niques based on the presence of repetitive elements, conserved 
sequences, or loci with variable numbers of tandem repeats have 
been standardized (Kanduma et al, 2003), allowing the compari- 
son of isolates between laboratories and regions worldwide. The 
popular spoligotyping approach (Kamerbeek et al., 1997) ex- 
ploits the polymorphism at the direct repeat (DR) locus of 
Mtb. It is based on the polymerase chain reaction (PCR) ampli- 
fication of 43 short unique sequences (termed spacers) found 
between well-conserved 36-bp DRs and the subsequent 
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hybridization of the products onto a membrane with oligo- 
nucleotides complementary to each spacer. Since strains vary in 
the occurrence of particular spacers, each sample produces a 
distinctive spot pattern, which is then translated into a numerical 
code of 15 digits known as octal code (Dale et al., 2001). The 
web-based database SITVITWEB contains 2740 shared types or 
spoligotype international types (SITs) found among 58 180 clin- 
ical isolates, subsequently grouped into a list of 62 lineages/sub- 
lineages which normally show a geographic distribution (Demay 
et al., 2012). In silico genotyping approaches are required to 
bridge the gap between experimental and high-throughput 
sequencing, leading to the development of SpolPred, a software 
to predict the spoligotype from raw sequence reads. 

2 METHODS 

We have developed a C executable to predict the spoligotype octal code 
from files oifastq format. By making use of a 2-bit per nucleotide coding 
strategy to speed up performance, every 25-bp unique spacer is queried 
against each read allowing up to one mismatch (Ioerger et al, 2009). The 
spacer sequences chosen are the same as those used as probes in the 
original spoligotyping assay (Kamerbeek et al., 1997). The read length 
can be changed to support data from different (single or paired end, 
minimum 35 bp) sequencing platforms (see User Manual for more avail- 
able options). The appearance of all 43 queries is eventually translated 
into the octal code which is then matched to a spoligotype in 
SITVITWEB. The software has been applied to 51 Ugandan Mtb isolates 
which underwent sequencing using Ulumina-GAII 76-bp paired end tech- 
nology at the Sanger Institute. The number of paired reads varied per 
sample from 10 to 30 million. To evaluate accuracy, S/;o//;/-«/-predicted 
results were compared with those experimentally determined for 44 sam- 
ples in our laboratory. In addition, results were compared with those 
manually extracted from de novo genome assemblies, generated using 
Velvet (Zerbino and Birney, 2008) [default settings, except 51k-mer 
word length, insert length (300 bp) and its standard deviation (30 bp)]. 
Subsequently, the same 25-bp spacers were queried against the resulting 
contigs. A cluster dendogram was constructed using 6998 single- 
nucleotide polymorphisms (SNPs) to allow the investigation of seven 
isolates with no experimental data. 

3 RESULTS 

The software was tested on a 64-bit Ubuntu Linux computer 
with a 3.07 GHz processor and 8 GB of RAM. As expected, 
running time per fastq file increased proportionally with read 
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coverage. Nevertheless, processed reads per unit of time re- 
mained constant (~500000 reads per minute). Spo IP red-interred 
SIT numbers matched the experimental ones for 39 samples 
(88.6%). The resulting octal codes from both in silico approaches 
were identical for 42 (95.5%) of the 44 samples with experimental 
results (see Supplementary Table SI for detailed results). In the 
remaining two isolates, de novo assembly failed to detect spacer 
25, which happens to fall between contiguous assembled contigs. 
The overall five non-matched in silico and experiment results 
were due to the increased in silico sensitivity of the detection of 
spacer 15 in the five samples and, additionally, spacer 26 in one 
sample. When the original hybridization blots were checked, an 
irregular signal distribution for spacer 1 5 across all samples was 
noted. Some signals were either too faint or just not detectable to 
be manually assigned as being present. Although predicted spo- 
ligotypes remained unchanged for samples 26 and 48, the other 
three (25, 40 and 49), which had octal codes not previously 
reported in the SITVITWEB database, were re-assigned to dif- 
ferent spoligotypes. Interestingly, these three isolates are consist- 
ently clustered in the SNP-based dendogram, i.e. within a clade 
of samples having the same experimental type. Similarly, all sam- 
ples with no laboratory data were clustered with isolates of the 
same predicted spoligotype (Fig. 1). 



4 DISCUSSION 

Although SNPs and other genetic variation derived from sequen- 
cing projects are likely to become the markers of choice due to 
their discriminatory power, PCR genotyping techniques are still 
widely employed. In this regard, SpolPred will enable the com- 
plementary comparison of computationally inferred spoligotypes 
with laboratory results. Currently, the tedious parameter opti- 
mization and computational requirements for de novo assembly 
are important constraints. Only 42 (95.5%) spoligotypes could 
be accurately inferred using the de novo strategy implemented. 
The region in the DR. locus harbouring spacer 25 does not seem 
to be reconstructed in two genomes, resulting in an incorrect type 
classification. Importantly, in those five samples for which 
SpolPred and experimental patterns differ, predicted octal 
codes were exactly the same when using both computational 
approaches. Furthermore, the newly assigned spoligotypes 
(namely, isolates 25, 40 and 49 in Fig. 1) are clustered with 
other isolates having coincident experimental and in silico pre- 
dicted lineages. The absent sequence responsible for the discre- 
pancies observed, namely, spacer 1 5, was the same across all five 
problematic isolates. The ambiguous distinction of this spacer 
has been reported (Abadia et al., 2011) and explained in terms 
of the presence of a 4-nt deletion adjacent to the amplified se- 
quence (van Embden et al., 2000), which would not allow a 
proper primer hybridization. Other ambiguities caused by the 
insertion of IS6110 copies in the DR region have also been re- 
ported (Filliol et al., 2000). As demonstrated, the software can be 
employed to accurately and quickly confirm experimentally 
determined spoligotypes, infer them from sequenced isolates 
with no laboratory data and reveal unexpected cases of wrongly 
assigned types. Other causes of TB misclassification such as la- 
boratory cross contamination, PCR contamination or ambigu- 
ous hybridization patterns could also be clarified. With the 
amount of sequence data increasingly growing, software like 
SpolPred will be useful additions to pipelines used to infer TB 
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18 [126,126] EAI5 
48 [1721,302] X1 
47 [302,302] X1 
10 [2356.2356] X1 
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38 [451,451] T-H37RV 
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46 [0,0] O 
28 [0,0] O 

41 [1,1] BEIJING 
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33 [26.26] CAS1_DELHI 
14 [26,26] CAS1_DELHI 

12 [26,26] CAS1_DELHI 

13 [26,26] CAS1_DELHI 
21 [21,21] CAS1_KILI 

26 [1675,21] CAS1_KILI 
8 [288,288] CAS2 
49 [0,288] CAS2 
50 [288,288] CAS2 
15 [288,288] CAS2 
16 [-,288] CAS2 




Fig. 1. Dendogram for 51 Ugandan isolates constructed using 6998 
SNPs, showing (from left to right): isolate number (experimentally, 
SpolPred determined SIT code) and SpolPred inferred spoligotype;- indi- 
cates no laboratory data available 



(e.g. MIRU-VNTR) and other bacterial strain types (e.g. MLST 
typing) and ultimately assist with disease control. 
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